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Abstract 

Given  an  m  x  n  matrix  A  we  are  interested  in  applying  it  to  a  real  vector  x  £  R™  in  less  then  the 
straightforward  O(mn)  time.  For  an  exact,  deterministic  computation  at  the  very  least  all  entrees  in  A 
must  be  accessed,  requiring  0(mn)  operations  and  matching  the  running  time  of  naively  applying  A  to 
x.  However,  we  claim  that  if  the  matrix  contains  only  a  constant  number  of  distinct  values,  then  reading 
the  matrix  once  in  0(mn)  steps  is  sufficient  to  preprocess  it  such  that  any  subsequent  application  to 
vectors  requires  only  0(mn/  \og(max{m,n}))  operations.  Theoretically  our  algorithm  can  improve  on 
recent  results  for  dimensionality  reduction  and  practically  it  is  useful  (faster)  even  for  small  matrices. 


Introduction 

A  classical  result  of  Winograd  ([1])  shows  that  general  matrix- vector  multiplication  requires  Q(mn)  opera¬ 
tions,  which  matches  the  running  time  of  the  naive  algorithm.  However,  since  matrix-vector  multiplication 
is  such  a  common,  basic  operation,  an  enormous  amount  of  effort  has  been  put  into  exploiting  the  special 
structure  of  matrices  to  accelerate  it.  For  example,  Fourier,  Walsh  Hadamard,  Toeplitz,  Vandermonde,  and 
others  can  be  applied  to  vectors  in  0(npolylog(n)). 

Others  have  focussed  on  matrix-matrix  multiplication.  For  two  n  x  n  binary  matrices  the  historical  Four 
Russians  Algorithm  [2]  (modified  in  [3])  gives  a  log  factor  improvement  over  the  naive  algorithm,  i.e,  running 
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time  of  /login).  Techniques  for  saving  log  factors  in  real  valued  matrix-matrix  multiplications  were  also 
found.  These  include  Santoro  and  Urrutia  [4]  and  Savage  [5].  These  methods  are  reported  to  be  practically 
more  efficient  then  naive  implementations.  Classic  theoretical  results  by  Strassen  [6]  and  Coppersmith  and 
Winograd  [7]  achieve  better  results  then  the  above  mentioned  but  are  slower  then  the  naive  implementation 
for  small  and  moderately  sized  matrices.  The  above  methods,  however,  do  not  extend  to  matrix-vector 
multiplication. 

We  return  to  matrix-vector  operations.  Since  every  entry  of  the  matrix,  A,  must  be  accessed  at  least  once, 
we  consider  a  preprocessing  stage  which  takes  fi(ran)  operations.  After  the  preprocessing  stage  x  is  given 
and  we  seek  an  algorithm  to  produce  the  product  Ax  as  fast  as  possible.  Within  this  framework  Williams 
[8]  showed  that  an  n  x  n  binary  matrix  can  be  preprocessed  in  time  0(n2+e)  and  subsequently  applied  to 
binary  vectors  in  0(n2 /e  log2  n).  Williams  also  extends  his  result  to  matrix  operations  over  finite  semirings. 
However,  to  the  best  of  the  authors’  understanding,  his  technique  cannot  be  extended  to  real  vectors. 

In  this  manuscript  we  claim  that  any  m  x  n  matrix  over  a  finite  alphabet 1  can  be  preprocessed  in  time 
0(mn)  such  that  it  can  be  applied  to  any  real  vector  x  €  R"  in  0(mn/  log(max{m,  n}))  operations.  Such 
operations  are  common,  for  example,  in  spectral  algorithms  for  unweighted  graphs,  nearest  neighbor  searches, 
dimensionality  reduction,  and  compressed  sensing.  Our  algorithm  also  achieves  all  previous  results  (excluding 
that  of  Williams)  while  using  a  strikingly  simple  approach. 


The  Mailman  algorithm 

Intuitively,  our  algorithm  multiplies  A  by  x  in  a  manner  that  brings  to  mind  the  way  a  mailman  distributes 
letters,  first  sorting  the  letters  by  house  and  then  delivering  them.  Recalling  the  identity  Ax  =  i  A^x(i), 
metaphorically  one  can  think  of  each  column  A.W  as  indicating  one  “address”  or  “house”,  and  each  entry  x(i) 
as  a  letter  addressed  to  it.  To  continue  the  metaphor,  imagine  that  computing  and  adding  the  term  A^x(i) 
to  the  sum  is  equivalent  to  the  effort  of  walking  to  house  A^  and  delivering  x{i).  From  this  perspective,  the 
naive  algorithm  functions  by  delivering  each  letter  individually  to  its  address,  regardless  of  how  far  the  walk 
is  or  if  other  letters  are  going  to  the  same  address.  Actual  mailmen,  of  course,  know  much  better.  First, 
they  arrange  their  letters  according  to  the  shortest  route  (which  includes  all  houses)  without  moving;  then 
they  walk  the  route,  visiting  each  house  regardless  of  how  many  letters  should  be  delivered  to  it  (possibly 
none). 

1A(i,j)  £  E,  and  |E|  is  a  finite  constant. 
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To  extend  this  idea  to  matrix-vector  multiplication,  our  algorithm  decomposes  A  into  two  matrices,  P 
and  U,  such  that  A  =  UP.  P  is  the  ’’address-letter”  correspondence  matrix.  Applying  P  to  x  is  analogous  to 
arranging  the  letters.  U  is  the  matrix  containing  all  possible  columns  in  A.  Applying  U  to  (Px)  is  analogous 
to  walking  the  route.  Hence,  we  name  our  algorithm  after  the  age-old  wisdom  of  the  men  and  women  of  the 
postal  service. 

Our  idea  can  be  easily  described  for  an  in  x  n  matrix  A,  where  m  =  log2(rc)  (w.l.o.g  assume  log2(n)  is 
an  integer)  and  A(i,j)  £  {0, 1}.  (Later  we  shall  modify  it  to  include  other  sizes  and  possible  entrees.)  There 
are  precisely  2m  =  n  possible  columns  of  the  matrix  A,  by  construction,  since  each  of  the  m  column  entrees 
can  be  only  0  or  1.  Now,  define  the  matrix  Un  as  the  matrix  containing  all  possible  columns  over  {0, 1}  of 
length  log(n).  By  definition  Un  contains  all  the  columns  that  appear  in  A.  More  precisely,  for  any  column 
xb'd  there  exists  an  index  1  <  i  <  n  such  that  A^A  =  Un\  We  can  therefore  define  an  n  x  n  matrix,  P, 
such  that  A  =  UP.  In  particular  the  matrix  P  contains  one  1  entree  per  column  such  that  P(i,j)  =  1  if 

AW  =  u&\ 


Applying  U 


Denote  by  Un  the  log2(n)  x  n  matrix  containing  all  possible  strings  over  {0,1}  of  length  log(n).  Notice 
immediately  that  Un  can  be  constructed  using  Un/-2  as  follows: 


Ui  =  (0  1),  Un 


0,...,0  1,...,1 

Un/2  Un/2 


Applying  Un  to  any  vector  z  requires  less  then  3n  operations,  which  can  be  shown  by  dividing  z  into  its  first 
and  second  halves,  z±  and  z2,  and  computing  the  product  Unz  recursively. 


(  0 . 0 

\  Un/2 


o  •  Y  zi  + 1  •  S  z2 

Un/2{Z\  +  Z2) 


Computing  the  vector  z\  +  z2,  and  the  sums  Z\  and  z2  requires  3n/2  operations.  We  get  the  recurrence 
relation 


T( 2)  =  2,  T(n)  =  T(n/ 2)  +  3n/2  =>  T(n)  <  3 n. 


The  sum  over  z\  is  of  course  redundant  here.  However,  it  would  not  have  been  if  the  entrees  in  U  were 
{— 1, 1}  for  example. 
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Constructing  P 

Since  U  is  applied  implicitly  (not  stored)  we  are  only  concerned  with  constructing  P.  An  entry  P(i,j)  is  set 
to  1  if  A td  =  IJn  ’ .  Notice  that  by  our  construction  of  Un  its  i’th  column  encodes  the  binary  value  of  the 
index  i.  Thus,  if  a  column  of  A  contains  the  binary  representation  of  the  value  i,  it  is  equal  to  We 

therefore  construct  P  by  reading  A  and  setting  P(i,j)  to  1  if  A^  represents  the  value  i.  This  clearly  can 
be  done  in  time  0{mn),  the  size  of  A.  Since  P  contains  only  n  nonzero  entries  (one  for  each  column  of  A), 
it  can  be  applied  in  n  steps. 

Although  A  is  of  size  log(n)  x  n,  it  can  be  applied  in  linear  time  (O(n))  and  a  log(n)  factor  is  gained.  If 
the  number  of  rows,  to,  in  A  is  more  then  log(n)  we  can  section  A  into  at  most  |"TO/Tog(n)]  matrices  each 
of  size  at  most  log(n)  x  n.  Since  each  of  the  sections  can  be  applied  in  0(n )  operations,  the  entire  matrix 
can  be  applied  in  0(mn/  log(n))  operations. 


Remarks 

Remark  1  ( n  <  to)  Notice  that  if  n  <  to  we  can  section  A  vertically  to  sections  of  size  to  x  [log(TO)]  and 
argue  very  similarly  that  PTUT  can  he  applied  in  linear  time.  Thus  a  log(m)  saving  factor  in  running  time 
is  achieved. 

Remark  2  (Constant  sized  alphabet)  The  definition  ofUn  can  be  changed  so  that  it  encodes  all  possible 
strings  over  a  larger  alphabet  E,  |E|  =  S  . 

(ci, . . . ,  o\  ...  oy, . . .  ,<Je 
Un/S  ■■■  Un/s 

The  matrix  Un  of  size  logs(n)  x  n  encodes  all  possible  strings  of  length  logs(n)  over  S  and  can  be  applied  in 
linear  time  0(n).  If  the  number  of  rows  in  A  is  larger,  we  divide  it  into  sections  of  size  log|S|(rc)  x  n.  The 
total  running  time  therefore  becomes  0(TOnlog(S')/log(max{TO,  n}). 

Remark  3  (Matrix  operations  over  semirings)  Suppose  we  supply  E  with  an  associative  and  commu¬ 
tative  addition  operation  (+),  and  a  multiplication  operation  (•)  that  distributes  over  (+).  If  both  A  and  x 
are  chosen  from  E  the  matrix  vector  multiplication  over  the  semiring  {E,+,  •}  can  be  performed  using  the 
exact  same  algorithm.  This  is,  of  course,  not  a  new  result.  However,  it  includes  all  other  log-factor-saving 
results. 
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Dimensionality  reduction 


Assume  we  are  given  p  points  {x\, . . . ,  xp}  in  Rn  and  are  asked  to  embed  them  into  Rm  such  that  all  distances 
between  points  are  preserved  up  to  distortion  e  and  m  <C  n.  A  classic  result  by  Johnson  and  Lindenstrauss 

[9]  shows  that  this  is  possible  for  m  =  ©(log (p)/e2).  The  algorithm  first  chooses  a  random  m  x  n  matrix, 
A,  from  a  special  distribution.  Then,  each  of  the  vectors  (points)  {xi, . . .  ,xp}  are  multiplied  by  A.  The 
embedding  x^  — ■>  A;c,  can  be  shown  to  have  the  desired  property  with  at  least  constant  probability. 

Clearly  a  naive  application  of  A  to  each  vector  requires  0(mri)  operations.  Recently  Ailon  and  Liberty 

[10] ,  modifying  the  work  of  Ailon  and  Chazelle  [11],  allows  A  to  be  applied  in  0(nlog(?n))  operations.  We 
now  claim  that,  in  some  situations,  an  older  result  by  Achlioptas  can  be  (trivially)  modified  to  out  preform 
this  last  result.  In  particular,  Achlioptas  [12]  showed  that  the  entries  of  A  can  be  chosen  i.i.d  from  {—1,  0, 1} 
with  some  constant  probabilities.  Using  our  method  and  Achlioptas’s  matrix,  one  can  apply  A  to  each  vector 
in  0(nlog(p) /  log(n)e2)  operations  (remainder  m  =  0(\og(p)  /  £2)).  Therefore,  if  p  is  polynomial  in  n  then 
log(?i)  =  0(log(p))  and  applying  A  to  Xi  requires  only  0{n/e2)  operations.  For  a  constant  e  this  outperforms 
the  best  known  0(nlog(m))  and  matches  the  lower  bound.  Notice  that  the  dependance  on  £  is  l/£2  instead 
of  log(l/£)  which  makes  this  result  actually  much  slower  for  most  practical  purposes. 


Experiments 

Here  we  compare  the  running  time  of  applying  a  log(n )  x  n  ,  {0, 1}  matrix  A  to  a  vector  of  floating  point 
variables  x  €  R"  using  three  methods.  The  first  is  a  naive  implementation  in  C.  This  simply  consists  of  two 
nested  loops  ordered  with  respect  to  memory  allocation.  The  second  is  an  optimized  matrix  vector  code. 
We  test  ourselves  against  LAPACK  which  uses  BLAS  subroutines.  The  third  is,  of  course,  the  mailman 
algorithm  following  the  preprocessing  stage. 

Although  the  complexity  of  the  first  two  methods  is  0(n  log  (n))  and  that  of  the  mailman  algorithm  is 
O(n)  the  improvement  factor  is  well  below  log(n).  This  is  because  the  memory  access  pattern  in  applying 
P  is  problematic  at  best,  and  creates  many  cache  faults.  Bare  in  mind  that  these  results  might  depend  on 
memory  speed  and  size  vs.  CPU  speed  of  the  specific  machine.  The  experiment  below  denotes  the  time 
required  for  the  actual  multiplication  not  including  memory  allocation.  (Experiments  were  conducted  on  an 
Intel  Pentium  M  1.4Ghz  processor,  598Mhz  Bus  speed  and  512Mb  of  RAM.) 

As  seen  in  figure  1  ,  the  mailman  algorithm  operates  faster  then  the  naive  algorithm  even  for  4  x  16 
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Figure  1:  Running  time  of  three  different  algorithms  for  matrix  vector  multiplication.  Naive  is  a  nested 
loop  written  in  C.  LAPACK  is  a  machine  optimized  code.  The  mailman  algorithm  is  described  above.  The 
matrices  were  of  size  m  (on  the  x  axis)  by  2m  and  they  were  multiplied  by  real  (double  precision)  values 
vectors.  The  right  figure  plots  the  running  times  relative  to  those  of  the  naive  algorithm. 

matrices.  It  also  outperforms  the  LAPACK  procedure  for  most  matrix  sizes.  The  machine  specific  optimized 
code  (LAPACK)  is  superior  when  the  matrix  row  allocation  size  approaches  the  memory  page  size.  A  machine 
specific  optimized  mailman  algorithm  might  take  advantage  of  the  same  phenomenon  and  outperform  the 
LAPACK  on  those  values  as  well. 


Concluding  remark 

It  has  been  known  for  a  long  time  that  a  log  factor  can  be  saved  in  matrix-vector  multiplication  when 
the  matrix  and  the  vector  are  over  constant  size  alphabets.  In  this  paper  we  described  an  algorithm  that 
achieves  this  while  also  dealing  with  real-valued  vectors.  As  such  the  idea  by  itself  is  neither  revolutionary 
nor  complicated,  but  it  is  useful  in  current  contexts.  We  showed,  as  a  simple  application  of  it,  that  random 
projections  can  be  achieved  asymptotically  faster  then  the  best  currently  known  algorithm,  provided  the 
number  of  projected  points  is  polynomial  in  their  original  dimension.  Moreover,  we  saw  that  our  algorithm 
is  practically  advantageous  even  for  small  matrices. 
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